1. PREPARE

The transition to digital learning has made available new sources of data, providing researchers new opportunities for understanding and improving STEM learning. Data sources such as digital learning environments and administrative data systems, as well as data produced by social media websites and the mass digitization of academic and practitioner publications, hold enormous potential to address a range of pressing problems in STEM Education, but collecting and analyzing text-based data also presents unique challenges.

Text Mining (TM) Module 1: Public Sentiment and the State Standards will help demonstrate how text mining can be applied in STEM education research and provide LASER Institute scholars hands-on experience with popular techniques for collecting, processing, and analyzing text-based data. Specifically, the four learning labs that make up this module address the following topics:

1a. Review the Literature

Text Mining Module 1 is guided by a recent publication by Rosenberg et al. (2020), Advancing new methods for understanding public sentiment about educational reforms: The case of Twitter and the Next Generation Science Standards. This study in turn builds on upon previous work by Wang & Fikis (2017) examining public opinion on the Common Core State Standards (CCSS) on Twitter. For Module 1, we will focus on analyzing tweets about the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS) in order to better understand key words and phrases that emerge, as well as public sentiment towards these two curriculum reform efforts.

Twitter and the Next Generation Science Standards

Full Paper (Preprint)

Abstract

While the Next Generation Science Standards (NGSS) are a long-standing and widespread standards-based educational reform effort, they have received less public attention, and no studies have explored the sentiment of the views of multiple stakeholders toward them. To establish how public sentiment about this reform might be similar to or different from past efforts, we applied a suite of data science techniques to posts about the standards on Twitter from 2010-2020 (N = 571,378) from 87,719 users. Applying data science techniques to identify teachers and to estimate tweet sentiment, we found that the public sentiment towards the NGSS is overwhelmingly positive – 33 times more so than for the CCSS. Mixed effects models indicated that sentiment became more positive over time and that teachers, in particular, showed a more positive sentiment towards the NGSS. We discuss implications for educational reform efforts and the use of data science methods for understanding their implementation.

Data Sources

Similar to what we’ll be learning in this lab, Rosenberg et al. used publicly accessible data from Twitter collected using the Full-Archive Twitter API and the rtweet package in R. Specifically, the authors accessed tweets and user information from the hashtag-based #NGSSchat online community, all tweets that included any of the following phrases, with “/” indicating an additional phrase featuring the respective plural form: “ngss,” “next generation science standard/s,” “next gen science standard/s.”

Data used in this lab was pulled using an Academic Research developer account and the {academictwitter} package, which uses the Twitter API v2 endpoints and allows researchers to access the full twitter archive. For those that created a standard developer account, the rtweet & the Twitter API supplemental learning lab will show you how to pull your own data from Twitter.

Data for this lab includes all tweets from 2020 that included the following terms: #ccss, common core, #ngsschat, ngss. Below is an example of the code used to retrieve data for this lab. It is set not to run and will not run if you try, but it does illustrate the search query used, variables selected, and time frame.

ccss_tweets_2021 <-
  get_all_tweets('(#commoncore OR "common core") -is:retweet lang:en',
                 "2021-01-01T00:00:00Z",
                 "2021-05-31T00:00:00Z",
                 bearer_token,
                 data_path = "ccss-data/",
                 bind_tweets = FALSE)

ccss_tweets <- bind_tweet_jsons(data_path = "ccss-data/") %>%
  select(text,
         created_at,
         author_id,
         id,
         conversation_id,
         source,
         possibly_sensitive,
         in_reply_to_user_id)


write_csv(ccss_tweets, "data/ccss-tweets.csv")
Analysis

Also similar to what we’ll demonstrate in Lab 3, the authors determined Tweet sentiment using the Java version of SentiStrength to assign tweets to two 5-point scales of sentiment, one for positivity and one for negativity, because SentiStrength is a validated measure for sentiment in short informal texts (Thelwall et al., 2011). In addition, they used this tool because Wang and Fikis (2019) used it to explore the sentiment of CCSS-related posts. We’ll be using the AFINN sentiment lexicon which also assigns words in a tweet to two 5-point scales, in addition to exploring some other sentiment lexicons to see if they produce similar results.

The authors also used the lme4 package in R to run a mixed effects model to determine if sentiment changes over time and differs between teachers and non-teacher. We won’t look at the relationships between tweet sentiment, time and teachers in these labs, but we will take a look at the correlation between words within tweets in TM Learning Lab 2.

Summary of Key Findings

  1. Contrasting with sentiment about CSSS, sentiment about the NGSS science education reform effort is overwhelmingly positive, with approximately 9 positive tweets for every negative tweet.
  2. Teachers were more positive than non-teachers, and sentiment became substantially more positive over the ten years of NGSS-related posts.
  3. Differences between the context of the tweets were small, but those that did not include the #NGSSchat hashtag became more positive over time than those posts that did not include the hashtag.
  4. Individuals posted more tweets during #NGSSchat chats, the sentiment of their posts was more positive, suggesting that while the context of individual tweets has a small effect (with posts not including the hashtag becoming more positive over time), the effect upon individuals of being involved in the #NGSSchat was positive.

Finally, you can watch Dr. Rosenberg provide a quick 3-minute overview of this work at <https://stanford.app.box.com/s/i5ixkj2b8dyy8q5j9o5ww4nafznb497x>

1b. Define Questions

One overarching question that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, and that we’ll explore throughout the text mining labs this year, is the question:

How do we to quantify what a document or collection of documents is about?

The questions guiding the Rosenberg et al. study attempt to quantify public sentiment around the NGSS and how that sentiment changes over time. Specifically, they asked:

  1. What is the public sentiment expressed toward the NGSS?
  2. How does sentiment for teachers differ from non-teachers?
  3. How do tweets posted to #NGSSchat differ from those without the hashtag?
  4. How does participation in #NGSSchat relate to the public sentiment individuals express?
  5. How does public sentiment vary over time?

For our first lab on text mining in STEM education, we’ll use approaches similar to those used by the authors cited above to better understand public discourse surrounding these standards, particularly as they relate to STEM education. We will also try to guage public sentiment around the NGSS, by comparing how much more positive or negative NGSS tweets are relative to CSSS tweets. Specifically, in the next four learning lab we’ll attempt to answer the following questions:

  1. What are the most frequent words or phrases used in reference to tweets about the CCSS and NGSS?
  2. What words and hashtags commonly occur together?
  3. How does sentiment for NGSS compare to sentiment for CCSS?

1c. Load Libraries

tidyverse 📦

As noted in our Getting Started activity, R uses “packages” and add-ons that enhance its functionality. One package that we’ll be using extensively is {tidyverse}. The {tidyverse} package is actually a collection of R packages designed for reading, wrangling, and exploring data and which all share an underlying design philosophy, grammar, and data structures. This shared features are sometimes “tidy data principles.”

Click the green arrow in the right corner of the “code chunk” that follows to load the {tidyverse} library.

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.3     ✓ purrr   0.3.4
✓ tibble  3.1.2     ✓ dplyr   1.0.6
✓ tidyr   1.1.3     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.1
── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Again, don’t worry if you saw a number of messages: those probably mean that the tidyverse loaded just fine. Any conflicts you may have seen mean that functions in these packages you loaded have the same name as functions in other packages and R will default to function from the last loaded package unless you specify otherwise.

tidytext 📦

As we’ll learn first hand in this module, using tidy data principles can also make many text mining tasks easier, more effective, and consistent with tools already in wide use. The {tidytext} package helps to convert text into data frames of individual words, making it easy to to manipulate, summarize, and visualize text using using familiar functions form the {tidyverse} collection of packages.

Let’s go ahead and load the {tidytext} package:

library(tidytext)

For a more comprehensive introduction to the tidytext package, we cannot recommend enough the free online book, Text Mining with R: A Tidy Approach (Silge & Robinson, 2017). If you’re interested in pursuing text analysis using R post Summer Workshop, this will be a go to reference.

2. WRANGLE

The importance of data wrangling, particularly when working with text, is difficult to overstate. Just as a refresher, wrangling involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al., 2018). Learning Lab 2 will have a heavy emphasis on preparing text for analysis and in particular we’ll learn how to:

  1. Import Data. First we revisit the familiar read_csv() function for reading in our CCSS and NGSS tweets
  2. Restructure Data. We focus on removing extraneous data using the select() and filter() functions from {dplyr}, and introduce two new functions for merging the data frames that we imported.
  3. Tidy Text. We introduce the {tidytext} package to “tidy” and tokenize our tweets in order to create our data frame for analysis revisit the concept of joins to remove “stop words” that don’t add much value to our analysis.

2a. Import and View Data

ccss_tweets <- read_csv("data/ccss-tweets.csv", 
          col_types = cols(author_id = col_character(), 
                           id = col_character(),
                           conversation_id = col_character(), 
                           in_reply_to_user_id = col_character()
                           )
          )

Note the addition of the col_types = argument for changing some of the column types to character strings because the numbers for those particular columns actually indicate identifiers for authors and tweets:

  • author_id = the author of the tweet

  • id = the unique id for each tweet

  • converastion_id = the unique id for each conversation thread

  • in_reply_to_user_id = the author of the tweet being replied to

Your Turn ⤵

RStudio Tip: Importing data and dealing with data types can be a bit tricky, especially for beginners. Fortunately, RStudio has an “Import Dataset” feature in the Environment Pane that can help you use the {readr} package and associated functions to greatly facilitate this process.

Try using the “Import Dataset” feature in the upper right environment pane to import the NGSS tweets located in the data folder.

The code generated should look something like this:

ngss_tweets <- read_csv("data/ngss-tweets.csv", 
          col_types = cols(author_id = col_character(), 
                           id = col_character(),
                           conversation_id = col_character(), 
                           in_reply_to_user_id = col_character()
                           )
          )

Use the following code chunk to inspect your tweets using a function you’ve learned so for for viewing your data:

# your code here

2b. Restructure Data

Subset Tweets

As you may have noticed, we have more data than we need for our analysis and should probably pare it down to just what we’ll use.

Let’s start with the First, since this is a family friendly learning lab, let’s use the filter() function introduced in previous labs to filter out rows containing “possibly sensitive” language:

ccss_tweets_1 <- ccss_tweets %>% 
  filter(possibly_sensitive == "FALSE")

Now let’s use the select() function to select the following columns from our new ss_tweets_clean data frame:

  1. text containing the tweet which is our primary data source of interest
  2. author_id of the user who created the tweet
  3. created_at timestamp for examining changes in sentiment over time
  4. conversation_id for examining sentiment by conversations
  5. id for the unique reference id for each tweet and useful for counts
ccss_tweets_2 <- ccss_tweets_1 %>% 
  select(text,
         author_id,
         created_at, 
         conversation_id,
         id)

Your Turn ⤵

Note: The select() function will also reorder your columns based on the order in which you list them.

Use the code chunk below to reorder the columns to your liking and assign to ccss_tweets_3:

# your code here

Add & Relocate Columns

Finally, since we are interested in comparing the sentiment of NGSS tweets with CSSS tweets, it would be helpful if we had a column to quickly identify the set of state standards with which each tweet is associated.

We’ll use the mutate() function to create a new variable called standards to label each tweets as “ngss”:

ccss_tweets_4 <- mutate(ccss_tweets_2, standards = "ccss")

colnames(ccss_tweets_4)
[1] "text"            "author_id"       "created_at"      "conversation_id" "id"             
[6] "standards"      

And just because it bothers me, I’m going to use the relocate() function to move the standards column to the first position so I can quickly see which standards the tweet is from:

ccss_tweets_5 <- relocate(ccss_tweets_4, standards)

colnames(ccss_tweets_5)
[1] "standards"       "text"            "author_id"       "created_at"      "conversation_id"
[6] "id"             

Again, we could also have used the select() function to reorder columns like so:

ccss_tweets_5 <- ccss_tweets_4 %>% 
  select(standards,
         text,
         author_id,
         created_at, 
         conversation_id,
         id)

colnames(ccss_tweets_5)
[1] "standards"       "text"            "author_id"       "created_at"      "conversation_id"
[6] "id"             

Before moving on to the CCSS standards, let’s use the %>% operator and rewrite the code from our wrangling so there is less redundancy and it is easier to read:

# Search Tweets
ccss_tweets_clean <- ccss_tweets %>%
  filter(possibly_sensitive == "FALSE") %>%
  select(text, author_id, created_at, conversation_id, id) %>%
  mutate(standards = "ccss") %>%
  relocate(standards)

head(ccss_tweets_clean)

Your Turn

Recall from section 1b. Define Questions that we are interested in comparing word usage and public sentiment around both the Common Core and Next Gen Science Standards.

Create an new ngss_tweets_clean data frame consisting of the Next Generation Science Standards tweets we imported by use the code above as a guide.

# your code here

Try not to peek at the answer below unless you are having difficulty with your code.

Answer

ngss_tweets_clean <- ngss_tweets %>%
  filter(possibly_sensitive == "FALSE") %>%
  select(text, author_id, created_at, conversation_id, id) %>%
  mutate(standards = "ngss") %>%
  relocate(standards)

head(ngss_tweets_clean)

Merge Data Frames

Finally, let’s combine our CCSS and NGSS tweets into a single data frame by using the union() function from dplyr and simply supplying the data frames that you want to combine as arguments:

ss_tweets <- union(ccss_tweets_clean,
                   ngss_tweets_clean)

Note that when creating a “union” like this (i.e. stacking one data frame on top of another), you should have the same number of columns in each data frame and they should be in the exact same order.

Your Turn

Finally, let’s take a quick look at both the head() and the tail() of this new ss_tweets data frame to make sure it contains both “ngss” and “ccss” standards:

head(ss_tweets)
tail(ss_tweets)

Wow, so much for a family friendly learning lab! Based on this very limited sample, which set of standards do you think Twitter users are more negative about?

  • your response here

Let’s take a slightly larger sample of the CCSS tweets:

ss_tweets %>% 
  filter(standards == "ccss") %>%
  sample_n(20) %>%
  relocate(text)
NA

Your Turn

Use the code chunk below to take a sample of the NGSS tweets:

# your code here

Still of the same opinion?

  • Respond here…

2c. Tidy Text

Text data by it’s very nature is ESPECIALLY untidy and is sometimes referred to as “unstructured” data. In this section we are introduced to the tidytext package and will learn some new functions to convert text to and from tidy formats. Having our text in a tidy format will allow us to switch seamlessly between tidy tools and existing text mining packages, while also making it easier to visualize text summaries in other data analysis tools like Tableau.

Tokenize Text

In Chapter 1 of Text Mining with R, Silge & Robinson (2017) define the tidy text format as a table with one-token-per-row, and explain that:

A token is a meaningful unit of text, such as a word, two-word phrase (bigram), or sentence that we are interested in using for analysis. And tokenization is the process of splitting text into tokens.

This one-token-per-row structure is in contrast to the ways text is often stored for text analysis, perhaps as strings in a corpus object or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.

For this part of our workflow, our goal is to transform our ss_tweets data from this:

head(relocate(ss_tweets, text))

Into a “tidy text” one-token-per-row format that looks like this:

tidy_tweets <- ss_tweets %>% 
  unnest_tokens(output = word, 
                input = text) %>%
  relocate(word)

head(tidy_tweets)

Later in the year, we’ll learn about other data structures for text analysis like the document-term matrix and corpus objects. For now, however, working with the familiar tidy data frame allows us to take advantage of popular packages that use the shared tidyverse syntax and principles for wrangling, exploring, and modeling data.

Unigrams

As demonstrated above, the tidytext package provides the incredibly powerful unnest_tokens() function to tokenize text (including tweets!) and convert them to a one-token-per-row format.

Let’s tokenize our tweets by using this function to split each tweet into a single row to make it easier to analyze and take a look:

ss_tokens <- unnest_tokens(ss_tweets, 
                             output = word, 
                             input = text)

head(relocate(ss_tokens, word))

There is A LOT to unpack with this function:

  • First notice that unnest_tokens() expects a data frame as the first argument, followed by two column names.
  • The next argument is an output column name that doesn’t currently exist but will be created as the text is “unnested” into it, word in this case).
  • This is followed by the input column that the text comes from, which we uncreatively named text.
  • By default, a token is an individual word or unigram.
  • Other columns, such as author_id and created_at, are retained.
  • All punctuation has been removed.
  • Tokens have been changed to lowercase, which makes them easier to compare or combine with other datasets (use the to_lower = FALSE argument to turn off if desired).

Note: Since {tidytext} follows tidy data principles, we also could have used the %>% operator to pass our data frame to the unnest_tokens() function like so:

ss_tokens <- ss_tweets %>%
  unnest_tokens(output = word, 
                input = text)

Your Turn ⤵

The unnest_tokens() function also has a specialized “tweets” tokenizer in the tokens = argument that is very useful for dealing with Twitter text. It retains hashtags and mentions of usernames with the @ symbol as illustrated by our @catturd2 friend who featured prominently in our the first CCSS tweet.

Rewrite the code above (you can check answer below) to include the token argument set to “tweets,” assign to ss_tokens_1, and answer the questions that follow:

# your code here
  1. How many observations were our original ss_tweets data frame?

  2. How many observations are there now? Why the difference?

Answer

Your code should look something like this:

ss_tokens_1 <- unnest_tokens(ss_tweets, 
                              output = word, 
                              input = text, 
                              token = "tweets")
Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
head(ss_tokens_1)

Before we move any further let’s take a quick look at the most common word in our two datasets:

ss_tokens_1 %>%
  count(word, sort = TRUE)
NA

Well, many of these tweets are clearly about the ccss and math at least, but beyond that it’s a bit hard to tell because there are so many “stop words” like “the,” “to,” “and,” “in” that don’t carry much meaning by themselves.

Remove Stop Words

Often in text analysis, we will want to remove these stop words if they are not useful for an analysis. The stop_words dataset in the {tidytext} package contains stop words from three lexicons. We can use them all together, as we have here, or filter() to only use one set of stop words if that is more appropriate for a certain analysis.

Let’s take a closer the lexicons and stop words included in each:

View(stop_words)

The anti_join Function

In order to remove these stop words, we will use a function called anti_join() that looks for matching values in a specific column from two datasets and returns rows from the original dataset that have no matches like so:

For a good overview of the different dplyr joins see here: https://medium.com/the-codehub/beginners-guide-to-using-joins-in-r-682fc9b1f119.

Now let’s remove stop words that don’t help us learn much about what people are saying about the state standards.

ss_tokens_2 <- anti_join(ss_tokens_1,
                           stop_words,
                           by = "word")

head(ss_tokens_2)

Notice that we’ve specified the by = argument to look for matching words in the word column for both data sets and remove any rows from the tweet_tokens dataset that match the stop_words dataset. Remember when we first tokenized our dataset I conveniently chose output = word as the column name because it matches the column name word in the stop_words dataset contained in the tidytext package. This makes our call to anti_join()simpler because anti_join() knows to look for the column named word in each dataset. However this wasn’t really necessary since word is the only matching column name in both datasets and it would have matched those columns by default.

Your Turn ⤵

Use the code chunk below to take a quick count of the most common tokens in our ss_tweets_2 data frame to see if the results are a little more meaningful

ss_tokens_2 %>%
  count(word, sort = TRUE)

Custom Stop Words

Notice that the nonsense word “amp” is among our high frequency words as well as some. We can create our own custom stop word list to to weed out any additional words that don’t carry much meaning but skew our data by being so prominent.

Let’s create a custom stop word list by using the simple c() function to combine our words. We can the add a filter to keep rows where words in our word column do NOT ! match words %in% my_stopwords list:

my_stopwords <- c("amp", "=", "+")

ss_tokens_3 <-
  ss_tokens_2 %>%
  filter(!word %in% my_stopwords)

Let’s take a look at our top words again and see if that did the trick:

ss_tokens_3 %>%
  count(word, sort = TRUE)

Much better! Note that we could extend this stop word list indefinitely. Feel free to use the code chunk below to try adding more words to our stop list.

Before we move any further, let’s save our tidied tweets as a new data frame for Section 3 and also save it as a .csv file in our data folder:

ss_tidy_tweets <- ss_tokens_3

write_csv(ss_tokens_3, "data/ss_tidy_tweets.csv")

3. EXPLORE

Calculating summary statistics, data visualization, and feature engineering (the process of creating new variables from a dataset) are a key part of exploratory data analysis. In Section 3, we keep things relatively simple and focus on some simple data summaries:

  1. Word Counts. We focus primarily on the use of word counts and briefly introduce word frequencies to help us identify word commonly used in tweets about the NGSS and CCSS curriculum standards.

  2. Word Frequencies. We wrap up this lab and preview some data visualization work in later labs by creating a simple wordcloud to explore summarize and highlight key words among our tweets.

3a. Word Counts

As highlighted in Word Counts are Amazing, an excellent post and blog by Ted Underwood at University of Illinois, one simple but powerful approach to text analysis is counting the frequency in which words occur in a given collection of documents, or corpus.

Word counts are a good example of a simple approach that illustrates the central question to text mining and natural language processing, introduced at the beginning:

How do we to quantify what a document or collection of documents is about?

So far, we’ve used the count() function from the {dplyr} package to look at word counts across our entire corpus of tweets.

Let’s use the same function to look at counts of the most common words by standards this time since one of our goals is to compare public sentiment between the two standards:

ss_tidy_tweets %>%
  count(standards, word, sort = TRUE)

Note that we included standards in our function to count how often each word occurs for each set of standards. For example, if you tab through the output, you will see that “students” is among the top words in both sets of standards, and occurs 1,432 times in the NGSS tweets and 1,127 times in the CCSS tweets.

Unsurprisingly words from our Twitter API search query are among the top words in each set of standards as well.

It’s a little difficult to directly compare the top words in each set since they are lumped together. Let’s use our filter() function again to just look at the CCSS tweets and save this for later to use in our Reach activity:

ccss_counts <- ss_tidy_tweets %>%
  filter(standards == "ccss") %>%
  count(word, sort = TRUE)

ccss_counts

Your Turn ⤵

Now use the code below to get the counts for our NGSS tweets so we can compare the top words for each set of standards:

# your code here

What might the top words for each set of standards suggest about similarities and differences for how Twitter users talk about each? What might it suggest about public sentiment?

  • your response here

3b. Word Frequencies

We saw above that the word “students” is among the top words in both sets of standards, but to help facilitate comparisons, is often helpful to look at the frequency that each word occurs among all words for that document group. This will also helps us to better gauge how prominent each word is for each set of standards.

For example, let’s create counts for each standards and word paring like we did above, and then create a new column using the mutate() function that calculates the proportion that word makes up among all words:

ccss_frequencies <- ccss_counts %>%
  mutate(proportion = n / sum(n))

ccss_frequencies

Your Turn ⤵

Now use the code below to get the frequencies for our NGSS tweets so we can compare the top words for each set of standards:

# your code here

We can see in both cases that our search terms are heavily skewing our proportions. What might we do to address this?

  • your response here

4. MODEL

As highlighted in Chapter 3 of Data Science in Education Using R , the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful. In TM Learning Lab 3, we’ll take a closer look at the study by Rosenberg et al. (2020) and how they used modeling to compare differences in sentiment between teachers and non-teachers discussing the common core.

5. COMMUNICATE

Congratulations - you’ve completed the first text mining learning lab! To complete your work, you can click the drop down arrow at the top of the file, then select “Knit top HTML.” This will create a report in your Files pane that serves as a record of your code and its output you can open or share.

If you wanted, you could save the processed data set to your data folder. The write_csv() function is useful for this. The following code is set to not run, as we wanted to ensure that everyone had the data set needed to begin the second learning lab, but if you’re confident in your prepared data, you can save it with the following:

write_csv()

Reach (Optional)

If you’re using data that you brought to the institute or data that you pulled from Twitter, try tidying your data into a tidy text format and examining the top words in your dataset.

If you’d like to use the data we’ve been working with for your reach, let’s try some basic data visualization with text. The wordcloud2 package is pretty dead simple tool for generating HTML based word clouds.

For example, let’s load the wordclouds2 library, and run the wordcloud2() function on our ccss_counts data frame:

library(wordcloud2)

wordcloud2(ccss_counts)

As you can see, “math” is a pretty common topic with discussing the common core on twitter but words like “core” and “common” are not very helpful since those were in our search terms when pulling data from Twitter.

In a separate r script file, try modifying our list of stop words, retidying our text, and doing a new word count with these and perhaps other words removed. Also, take a look at the help file for wordclouds2 to see if there might be otherwise you could visually improve this visualization.

Word clouds are much maligned and sometimes referred to as pie charts for words, but they can be useful for quickly summarizing and communicating qualitative data for education practitioners and are intuitive for them to interpret. Also, for better or worse, these are now included as a default viz for open-ended survey items in online Qualtrics reports.

Reflection

In this learning lab, we focused on the literature guiding our analysis; wrangling our data into a one-token-per-row tidy text format; and using simple word counts and frequencies to compare common words used in tweets about the NGSS and CCSS curriculum standards. Below, add a few notes in response to the following prompts:

  1. One thing I took away from this learning lab:

  2. One thing I want to learn more about:

References

Note: Citations embedded in R Markdown will only show upon knitting.

Krumm, A., Means, B., & Bienkowski, M. (2018). Learning analytics goes to school. Routledge. https://doi.org/10.4324/9781315650722
Rosenberg, J., Borchers, C., Dyer, E. B., Anderson, D., & Fischer, C. (2020). Understanding public sentiment about educational reforms: The next generation science standards on twitter. http://dx.doi.org/10.31219/osf.io/xymsd
Silge, J., & Robinson, D. (2017). Text mining with r: A tidy approach. " O’Reilly Media, Inc.". https://www.tidytextmining.com
Wang, Y., & Fikis, D. J. (2017). Common Core State Standards on Twitter: Public Sentiment and Opinion Leaders. Educational Policy, 33(4), 650–683. https://doi.org/10.1177/0895904817723739
---
title: 'Tidy Text, Tokens, and Twitter'
subtitle: 'Text Mining Learning Lab 1'
author: "YOUR NAME HERE"
date: "`r format(Sys.Date(),'%B %e, %Y')`"
output:
  html_document:
    toc: true
    toc_depth: 5
    toc_float: yes
  html_notebook:
bibliography: lit/references.bib
csl: lit/apa.csl
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 1. PREPARE

The transition to digital learning has made available new sources of data, providing researchers new opportunities for understanding and improving STEM learning. Data sources such as digital learning environments and administrative data systems, as well as data produced by social media websites and the mass digitization of academic and practitioner publications, hold enormous potential to address a range of pressing problems in STEM Education, but collecting and analyzing text-based data also presents unique challenges.

**Text Mining (TM) Module 1: Public Sentiment and the State Standards** will help demonstrate how text mining can be applied in STEM education research and provide LASER Institute scholars hands-on experience with popular techniques for collecting, processing, and analyzing text-based data. Specifically, the four learning labs that make up this module address the following topics:

-   **Learning Lab 1: Tidy Text, Tokens, & Twitter.** We take a closer look at the literature guiding our analysis; wrangle our data into a one-token-per-row tidy text format; and use simple word counts to explore our tweets about the common core and next generation science standards.

-   **Learning Lab 2: Twice the fun with Bigrams.** For our second lab, we explore our unigrams, or single word tokens a little more, and also see what pairs of words and word correlations tell us about our tweets what insight they provide in response to our research questions.

-   **Learning Lab 3: Come to the Dark Side.** We focus on the use of lexicons in our third lab and introduce the {vader} package to compare the sentiment of tweets about the NGSS and CCSS state standards in order to better understand public reaction to these two curriculum reform efforts. 

-   **Learning Lab 4: A Tale of Two Standards.** We wrap our look at public sentiment around STEM state curriculum standards by selecting an analysis that provides some unique insight; refining and polishing a data product; and writing a brief narrative to communicate findings in response to our research questions.

### 1a. Review the Literature

Text Mining Module 1 is guided by a recent publication by @rosenberg2020, *Advancing new methods for understanding public sentiment about educational reforms: The case of Twitter and the Next Generation Science Standards*. This study in turn builds on upon previous work by @wang2017 examining public opinion on the Common Core State Standards (CCSS) on Twitter. For Module 1, we will focus on analyzing tweets about the [Next Generation Science Standards](https://www.nextgenscience.org) (NGSS) and [Common Core State Standards](http://www.corestandards.org) (CCSS) in order to better understand key words and phrases that emerge, as well as public sentiment towards these two curriculum reform efforts.

#### Twitter and the Next Generation Science Standards

![](img/rosenberg.png){width="30%"}

[Full Paper (Preprint)](https://osf.io/xymsd/.)

##### **Abstract**

While the Next Generation Science Standards (NGSS) are a long-standing and widespread standards-based educational reform effort, they have received less public attention, and no studies have explored the sentiment of the views of multiple stakeholders toward them. To establish how public sentiment about this reform might be similar to or different from past efforts, we applied a suite of data science techniques to posts about the standards on Twitter from 2010-2020 (N = 571,378) from 87,719 users. Applying data science techniques to identify teachers and to estimate tweet sentiment, we found that the public sentiment towards the NGSS is overwhelmingly positive -- 33 times more so than for the CCSS. Mixed effects models indicated that sentiment became more positive over time and that teachers, in particular, showed a more positive sentiment towards the NGSS. We discuss implications for educational reform efforts and the use of data science methods for understanding their implementation.

##### **Data Sources**

Similar to what we'll be learning in this lab, Rosenberg et al. used publicly accessible data from Twitter collected using the Full-Archive Twitter API and the `rtweet` package in R. Specifically, the authors accessed tweets and user information from the hashtag-based \#NGSSchat online community, all tweets that included any of the following phrases, with "/" indicating an additional phrase featuring the respective plural form: "ngss", "next generation science standard/s", "next gen science standard/s".

Data used in this lab was pulled using an [Academic Research developer account](https://developer.twitter.com/en/products/twitter-api/academic-research) and the {academictwitter} package, which uses the Twitter API v2 endpoints and allows researchers to access the full twitter archive. For those that created a standard developer account, the [rtweet & the Twitter API](https://laser-institute.github.io/text-mining/tm-learning-lab-1a.html) supplemental learning lab will show you how to pull your own data from Twitter.

Data for this lab includes all tweets from 2020 that included the following terms: `#ccss`, `common core`, `#ngsschat`, `ngss`. Below is an example of the code used to retrieve data for this lab. It is set not to run and will not run if you try, but it does illustrate the search query used, variables selected, and time frame.

```{r, eval=FALSE}
ccss_tweets_2021 <-
  get_all_tweets('(#commoncore OR "common core") -is:retweet lang:en',
                 "2021-01-01T00:00:00Z",
                 "2021-05-31T00:00:00Z",
                 bearer_token,
                 data_path = "ccss-data/",
                 bind_tweets = FALSE)

ccss_tweets <- bind_tweet_jsons(data_path = "ccss-data/") %>%
  select(text,
         created_at,
         author_id,
         id,
         conversation_id,
         source,
         possibly_sensitive,
         in_reply_to_user_id)


write_csv(ccss_tweets, "data/ccss-tweets.csv")
```

##### **Analysis**

Also similar to what we'll demonstrate in Lab 3, the authors determined Tweet sentiment using the Java version of SentiStrength to assign tweets to two 5-point scales of sentiment, one for positivity and one for negativity, because SentiStrength is a validated measure for sentiment in short informal texts (Thelwall et al., 2011). In addition, they used this tool because Wang and Fikis (2019) used it to explore the sentiment of CCSS-related posts. We'll be using the AFINN sentiment lexicon which also assigns words in a tweet to two 5-point scales, in addition to exploring some other sentiment lexicons to see if they produce similar results.

The authors also used the `lme4` package in R to run a mixed effects model to determine if sentiment changes over time and differs between teachers and non-teacher. We won't look at the relationships between tweet sentiment, time and teachers in these labs, but we will take a look at the correlation between words within tweets in TM Learning Lab 2.

**Summary of Key Findings**

1.  Contrasting with sentiment about CSSS, sentiment about the NGSS science education reform effort is overwhelmingly positive, with approximately 9 positive tweets for every negative tweet.
2.  Teachers were more positive than non-teachers, and sentiment became substantially more positive over the ten years of NGSS-related posts.
3.  Differences between the context of the tweets were small, but those that did not include the \#NGSSchat hashtag became more positive over time than those posts that did not include the hashtag.
4.  Individuals posted more tweets during \#NGSSchat chats, the sentiment of their posts was more positive, suggesting that while the context of individual tweets has a small effect (with posts not including the hashtag becoming more positive over time), the effect upon individuals of being involved in the \#NGSSchat was positive.

Finally, you can watch Dr. Rosenberg provide a quick 3-minute overview of this work at [\<https://stanford.app.box.com/s/i5ixkj2b8dyy8q5j9o5ww4nafznb497x\>](https://stanford.app.box.com/s/i5ixkj2b8dyy8q5j9o5ww4nafznb497x){.uri}

### 1b. Define Questions

One overarching question that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, and that we'll explore throughout the text mining labs this year, is the question:

> How do we to **quantify** what a document or collection of documents is about?

The questions guiding the Rosenberg et al. study attempt to quantify public sentiment around the NGSS and how that sentiment changes over time. Specifically, they asked:

1.  What is the public sentiment expressed toward the NGSS?
2.  How does sentiment for teachers differ from non-teachers?
3.  How do tweets posted to \#NGSSchat differ from those without the hashtag?
4.  How does participation in \#NGSSchat relate to the public sentiment individuals express?
5.  How does public sentiment vary over time?

For our first lab on text mining in STEM education, we'll use approaches similar to those used by the authors cited above to better understand public discourse surrounding these standards, particularly as they relate to STEM education. We will also try to guage public sentiment around the NGSS, by comparing how much more positive or negative NGSS tweets are relative to CSSS tweets. Specifically, in the next four learning lab we'll attempt to answer the following questions:

1.  What are the most frequent words or phrases used in reference to tweets about the CCSS and NGSS?
2.  What words and hashtags commonly occur together?
3.  How does sentiment for NGSS compare to sentiment for CCSS?

### 1c. Load Libraries

#### tidyverse 📦

![](img/tidyverse.png){width="20%"}

As noted in our Getting Started activity, R uses "packages" and add-ons that enhance its functionality. One package that we'll be using extensively is {tidyverse}. The {tidyverse} package is actually a [collection of R packages](https://www.tidyverse.org/packages) designed for reading, wrangling, and exploring data and which all share an underlying design philosophy, grammar, and data structures. This shared features are sometimes "tidy data principles."

Click the green arrow in the right corner of the "code chunk" that follows to load the {tidyverse} library.

```{r}
library(tidyverse)
```

Again, don't worry if you saw a number of messages: those probably mean that the tidyverse loaded just fine. Any conflicts you may have seen mean that functions in these packages you loaded have the same name as functions in other packages and R will default to function from the last loaded package unless you specify otherwise.

#### tidytext 📦

![](img/tidytext.png){width="20%"}

As we'll learn first hand in this module, using tidy data principles can also make many text mining tasks easier, more effective, and consistent with tools already in wide use. The {tidytext} package helps to convert text into data frames of individual words, making it easy to to manipulate, summarize, and visualize text using using familiar functions form the {tidyverse} collection of packages.

Let's go ahead and load the {tidytext} package:

```{r}
library(tidytext)
```

For a more comprehensive introduction to the `tidytext` package, we cannot recommend enough the free online book, *Text Mining with R: A Tidy Approach* [@silge2017text]. If you're interested in pursuing text analysis using R post Summer Workshop, this will be a go to reference.

## 2. WRANGLE

The importance of data wrangling, particularly when working with text, is difficult to overstate. Just as a refresher, wrangling involves the initial steps of going from raw data to a dataset that can be explored and modeled [@krumm2018]. Learning Lab 2 will have a heavy emphasis on preparing text for analysis and in particular we'll learn how to:

a.  **Import Data**. First we revisit the familiar `read_csv()` function for reading in our CCSS and NGSS tweets
b.  **Restructure Data**. We focus on removing extraneous data using the `select()` and `filter()` functions from {dplyr}, and introduce two new functions for merging the data frames that we imported.
c.  **Tidy Text.** We introduce the {tidytext} package to "tidy" and tokenize our tweets in order to create our data frame for analysis revisit the concept of joins to remove "stop words" that don't add much value to our analysis.

### 2a. Import and View Data

```{r}
ccss_tweets <- read_csv("data/ccss-tweets.csv", 
          col_types = cols(author_id = col_character(), 
                           id = col_character(),
                           conversation_id = col_character(), 
                           in_reply_to_user_id = col_character()
                           )
          )
```

Note the addition of the `col_types =` argument for changing some of the column types to character strings because the numbers for those particular columns actually indicate identifiers for authors and tweets:

-   `author_id` = the author of the tweet

-   `id` = the unique id for each tweet

-   `converastion_id` = the unique id for each [conversation thread](https://developer.twitter.com/en/docs/twitter-api/conversation-id)

-   `in_reply_to_user_id` = the author of the tweet being replied to

#### [Your Turn]{style="color: green;"} ⤵ {style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-tap-highlight-color: rgba(26, 26, 26, 0.3); -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"}

**RStudio Tip:** Importing data and dealing with data types can be a bit tricky, especially for beginners. Fortunately, RStudio has an "Import Dataset" feature in the Environment Pane that can help you use the {readr} package and associated functions to greatly facilitate this process.

![](img/import-data.png)

Try using the "Import Dataset" feature in the upper right environment pane to import the NGSS tweets located in the data folder.

The code generated should look something like this:

```{r}
ngss_tweets <- read_csv("data/ngss-tweets.csv", 
          col_types = cols(author_id = col_character(), 
                           id = col_character(),
                           conversation_id = col_character(), 
                           in_reply_to_user_id = col_character()
                           )
          )
```

Use the following code chunk to inspect your tweets using a function you've learned so for for viewing your data:

```{r}
# your code here
```

### 2b. Restructure Data

#### Subset Tweets

As you may have noticed, we have more data than we need for our analysis and should probably pare it down to just what we'll use.

Let's start with the First, since this is a family friendly learning lab, let's use the `filter()` function introduced in previous labs to filter out rows containing "possibly sensitive" language:

```{r, eval=TRUE}
ccss_tweets_1 <- ccss_tweets %>% 
  filter(possibly_sensitive == "FALSE")
```

Now let's use the `select()` function to select the following columns from our new `ss_tweets_clean` data frame:

1.  `text` containing the tweet which is our primary data source of interest
2.  `author_id` of the user who created the tweet
3.  `created_at` timestamp for examining changes in sentiment over time
4.  `conversation_id` for examining sentiment by conversations
5.  `id` for the unique reference id for each tweet and useful for counts

```{r select-variables, eval=TRUE}
ccss_tweets_2 <- ccss_tweets_1 %>% 
  select(text,
         author_id,
         created_at, 
         conversation_id,
         id)
```

#### [Your Turn]{style="color: green;"} ⤵ {style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-tap-highlight-color: rgba(26, 26, 26, 0.3); -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"}

**Note:** The `select()` function will also reorder your columns based on the order in which you list them.

Use the code chunk below to reorder the columns to your liking and assign to `ccss_tweets_3`:

```{r}
# your code here
```

#### Add & Relocate Columns

Finally, since we are interested in comparing the sentiment of NGSS tweets with CSSS tweets, it would be helpful if we had a column to quickly identify the set of state standards with which each tweet is associated.

We'll use the `mutate()` function to create a new variable called `standards` to label each tweets as "ngss":

```{r}
ccss_tweets_4 <- mutate(ccss_tweets_2, standards = "ccss")

colnames(ccss_tweets_4)
```

And just because it bothers me, I'm going to use the `relocate()` function to move the `standards` column to the first position so I can quickly see which standards the tweet is from:

```{r}
ccss_tweets_5 <- relocate(ccss_tweets_4, standards)

colnames(ccss_tweets_5)
```

Again, we could also have used the `select()` function to reorder columns like so:

```{r}
ccss_tweets_5 <- ccss_tweets_4 %>% 
  select(standards,
         text,
         author_id,
         created_at, 
         conversation_id,
         id)

colnames(ccss_tweets_5)
```

Before moving on to the CCSS standards, let's use the `%>%` operator and rewrite the code from our wrangling so there is less redundancy and it is easier to read:

```{r}
# Search Tweets
ccss_tweets_clean <- ccss_tweets %>%
  filter(possibly_sensitive == "FALSE") %>%
  select(text, author_id, created_at, conversation_id, id) %>%
  mutate(standards = "ccss") %>%
  relocate(standards)

head(ccss_tweets_clean)
```

#### [Your Turn]{style="color: green;"} ⤵

Recall from section [1b. Define Questions] that we are interested in comparing word usage and public sentiment around both the Common Core and Next Gen Science Standards.

Create an new `ngss_tweets_clean` data frame consisting of the Next Generation Science Standards tweets we imported by use the code above as a guide.

```{r}
# your code here
```

Try not to peek at the answer below unless you are having difficulty with your code.

#### Answer

```{r 2b-answer}
ngss_tweets_clean <- ngss_tweets %>%
  filter(possibly_sensitive == "FALSE") %>%
  select(text, author_id, created_at, conversation_id, id) %>%
  mutate(standards = "ngss") %>%
  relocate(standards)

head(ngss_tweets_clean)
```

#### Merge Data Frames

Finally, let's combine our CCSS and NGSS tweets into a single data frame by using the `union()` function from `dplyr` and simply supplying the data frames that you want to combine as arguments:

```{r}
ss_tweets <- union(ccss_tweets_clean,
                   ngss_tweets_clean)
```

Note that when creating a "union" like this (i.e. stacking one data frame on top of another), you should have the same number of columns in each data frame and they should be in the exact same order.

#### [Your Turn]{style="color: green;"} ⤵

Finally, let's take a quick look at both the `head()` and the `tail()` of this new `ss_tweets` data frame to make sure it contains both "ngss" and "ccss" standards:

```{r}
head(ss_tweets)
tail(ss_tweets)
```

Wow, so much for a family friendly learning lab! Based on this very limited sample, which set of standards do you think Twitter users are more negative about?

-   your response here

Let's take a slightly larger sample of the CCSS tweets:

```{r}
ss_tweets %>% 
  filter(standards == "ccss") %>%
  sample_n(20) %>%
  relocate(text)
           
```

#### [Your Turn]{style="color: green;"} ⤵

Use the code chunk below to take a sample of the NGSS tweets:

```{r}
# your code here
```

Still of the same opinion?

-   Respond here...

### 2c. Tidy Text

Text data by it's very nature is ESPECIALLY untidy and is sometimes referred to as "unstructured" data. In this section we are introduced to the `tidytext` package and will learn some new functions to convert text to and from tidy formats. Having our text in a tidy format will allow us to switch seamlessly between tidy tools and existing text mining packages, while also making it easier to visualize text summaries in other data analysis tools like Tableau.

#### Tokenize Text {data-link="2b. Tidy Text"}

In Chapter 1 of Text Mining with R, @silge2017text define the tidy text format as a table with one-token-per-row, and explain that:

> A **token** is a meaningful unit of text, such as a word, two-word phrase (bigram), or sentence that we are interested in using for analysis. And tokenization is the process of splitting text into tokens.

This one-token-per-row structure is in contrast to the ways text is often stored for text analysis, perhaps as strings in a corpus object or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.

For this part of our workflow, our goal is to transform our `ss_tweets` data from this:

```{r}
head(relocate(ss_tweets, text))
```

Into a "tidy text" one-token-per-row format that looks like this:

```{r}
tidy_tweets <- ss_tweets %>% 
  unnest_tokens(output = word, 
                input = text) %>%
  relocate(word)

head(tidy_tweets)
```

Later in the year, we'll learn about other data structures for text analysis like the document-term matrix and corpus objects. For now, however, working with the familiar tidy data frame allows us to take advantage of popular packages that use the shared tidyverse syntax and principles for wrangling, exploring, and modeling data.

#### Unigrams

As demonstrated above, the `tidytext` package provides the incredibly powerful `unnest_tokens()` function to tokenize text (including tweets!) and convert them to a one-token-per-row format.

Let's tokenize our tweets by using this function to split each tweet into a single row to make it easier to analyze and take a look:

```{r unnest-tokens}
ss_tokens <- unnest_tokens(ss_tweets, 
                             output = word, 
                             input = text)

head(relocate(ss_tokens, word))
```

There is A LOT to unpack with this function:

-   First notice that `unnest_tokens()` expects a data frame as the first argument, followed by two column names.
-   The next argument is an output column name that doesn't currently exist but will be created as the text is "unnested" into it, `word` in this case).
-   This is followed by the input column that the text comes from, which we uncreatively named `text`.
-   By default, a token is an individual word or unigram.
-   Other columns, such as `author_id` and `created_at`, are retained.
-   All punctuation has been removed.
-   Tokens have been changed to lowercase, which makes them easier to compare or combine with other datasets (use the `to_lower = FALSE` argument to turn off if desired).

**Note:** Since {tidytext} follows tidy data principles, we also could have used the `%>%` operator to pass our data frame to the `unnest_tokens()` function like so:

```{r}
ss_tokens <- ss_tweets %>%
  unnest_tokens(output = word, 
                input = text)
```

#### [Your Turn]{style="color: green;"} ⤵ {style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-tap-highlight-color: rgba(26, 26, 26, 0.3); -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"}

The `unnest_tokens()` function also has a specialized `“tweets”` tokenizer in the `tokens =` argument that is very useful for dealing with Twitter text. It retains hashtags and mentions of usernames with the \@ symbol as illustrated by our \@catturd2 friend who featured prominently in our the first CCSS tweet.

Rewrite the code above (**you can check answer below**) to include the token argument set to "tweets", assign to `ss_tokens_1`, and answer the questions that follow:

```{r}
# your code here
```

1.  How many observations were our original `ss_tweets` data frame?

    -   

2.  How many observations are there now? Why the difference?

    -   

#### Answer

Your code should look something like this:

```{r}
ss_tokens_1 <- unnest_tokens(ss_tweets, 
                              output = word, 
                              input = text, 
                              token = "tweets")

head(ss_tokens_1)
```

Before we move any further let's take a quick look at the most common word in our two datasets:

```{r}
ss_tokens_1 %>%
  count(word, sort = TRUE)

```

Well, many of these tweets are clearly about the ccss and math at least, but beyond that it's a bit hard to tell because there are so many "stop words" like "the", "to", "and", "in" that don't carry much meaning by themselves.

#### Remove Stop Words

Often in text analysis, we will want to remove these stop words if they are not useful for an analysis. The `stop_words` dataset in the {tidytext} package contains stop words from three lexicons. We can use them all together, as we have here, or `filter()` to only use one set of stop words if that is more appropriate for a certain analysis.

Let's take a closer the lexicons and stop words included in each:

```{r, eval=FALSE}
View(stop_words)
```

#### The `anti_join` Function

In order to remove these stop words, we will use a function called `anti_join()` that looks for matching values in a specific column from two datasets and returns rows from the original dataset that have no matches like so:

![](img/anti-join.png)

For a good overview of the different `dplyr` joins see here: <https://medium.com/the-codehub/beginners-guide-to-using-joins-in-r-682fc9b1f119>.

Now let's remove stop words that don't help us learn much about what people are saying about the state standards.

```{r stop-unigrams}
ss_tokens_2 <- anti_join(ss_tokens_1,
                           stop_words,
                           by = "word")

head(ss_tokens_2)
```

Notice that we've specified the `by =` argument to look for matching words in the `word` column for both data sets and remove any rows from the `tweet_tokens` dataset that match the `stop_words` dataset. Remember when we first tokenized our dataset I conveniently chose `output = word` as the column name because it matches the column name `word` in the `stop_words` dataset contained in the `tidytext` package. This makes our call to `anti_join()`simpler because `anti_join()` knows to look for the column named `word` in each dataset. However this wasn't really necessary since `word` is the only matching column name in both datasets and it would have matched those columns by default.

#### [Your Turn]{style="color: green;"} ⤵ {style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-tap-highlight-color: rgba(26, 26, 26, 0.3); -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"}

Use the code chunk below to take a quick count of the most common tokens in our `ss_tweets_2` data frame to see if the results are a little more meaningful

```{r}
ss_tokens_2 %>%
  count(word, sort = TRUE)
```

#### Custom Stop Words

Notice that the nonsense word "amp" is among our high frequency words as well as some. We can create our own custom stop word list to to weed out any additional words that don't carry much meaning but skew our data by being so prominent.

Let's create a custom stop word list by using the simple `c()` function to combine our words. We can the add a filter to keep rows where words in our `word` column do NOT `!` match words `%in%` `my_stopwords` list:

```{r}
my_stopwords <- c("amp", "=", "+")

ss_tokens_3 <-
  ss_tokens_2 %>%
  filter(!word %in% my_stopwords)
```

Let's take a look at our top words again and see if that did the trick:

```{r}
ss_tokens_3 %>%
  count(word, sort = TRUE)
```

Much better! Note that we could extend this stop word list indefinitely. Feel free to use the code chunk below to try adding more words to our stop list.

Before we move any further, let's save our tidied tweets as a new data frame for Section 3 and also save it as a .csv file in our data folder:

```{r}
ss_tidy_tweets <- ss_tokens_3

write_csv(ss_tokens_3, "data/ss_tidy_tweets.csv")
```

## 3. EXPLORE

Calculating summary statistics, data visualization, and feature engineering (the process of creating new variables from a dataset) are a key part of exploratory data analysis. In Section 3, we keep things relatively simple and focus on some simple data summaries:

a.  **Word Counts**. We focus primarily on the use of word counts and briefly introduce word frequencies to help us identify word commonly used in tweets about the NGSS and CCSS curriculum standards.

b.  **Word Frequencies**. We wrap up this lab and preview some data visualization work in later labs by creating a simple wordcloud to explore summarize and highlight key words among our tweets.

### 3a. Word Counts

As highlighted in [Word Counts are Amazing](https://tedunderwood.com/2013/02/20/wordcounts-are-amazing/), an excellent post and blog by Ted Underwood at University of Illinois, one simple but powerful approach to text analysis is counting the frequency in which words occur in a given collection of documents, or corpus.

Word counts are a good example of a simple approach that illustrates the central question to text mining and natural language processing, introduced at the beginning:

> How do we to **quantify** what a document or collection of documents is about?

So far, we've used the `count()` function from the {dplyr} package to look at word counts across our entire corpus of tweets.

Let's use the same function to look at counts of the most common words by standards this time since one of our goals is to compare public sentiment between the two standards:

```{r}
ss_tidy_tweets %>%
  count(standards, word, sort = TRUE)
```

Note that we included `standards` in our function to count how often each word occurs for each set of standards. For example, if you tab through the output, you will see that "students" is among the top words in both sets of standards, and occurs 1,432 times in the NGSS tweets and 1,127 times in the CCSS tweets.

Unsurprisingly words from our Twitter API search query are among the top words in each set of standards as well.

It's a little difficult to directly compare the top words in each set since they are lumped together. Let's use our `filter()` function again to just look at the CCSS tweets and save this for later to use in our Reach activity:

```{r}
ccss_counts <- ss_tidy_tweets %>%
  filter(standards == "ccss") %>%
  count(word, sort = TRUE)

ccss_counts
```

#### [Your Turn]{style="color: green;"} ⤵ {style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-tap-highlight-color: rgba(26, 26, 26, 0.3); -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"}

Now use the code below to get the counts for our NGSS tweets so we can compare the top words for each set of standards:

```{r}
# your code here
```

What might the top words for each set of standards suggest about similarities and differences for how Twitter users talk about each? What might it suggest about public sentiment?

-   your response here

### 3b. Word Frequencies

We saw above that the word "students" is among the top words in both sets of standards, but to help facilitate comparisons, is often helpful to look at the frequency that each word occurs among all words for that document group. This will also helps us to better gauge how prominent each word is for each set of standards.

For example, let's create counts for each `standards` and `word` paring like we did above, and then create a new column using the `mutate()` function that calculates the proportion that word makes up among all words:

```{r}
ccss_frequencies <- ccss_counts %>%
  mutate(proportion = n / sum(n))

ccss_frequencies
```

#### [Your Turn]{style="color: green;"} ⤵ {style="font-style: normal; font-variant-caps: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-tap-highlight-color: rgba(26, 26, 26, 0.3); -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);"}

Now use the code below to get the frequencies for our NGSS tweets so we can compare the top words for each set of standards:

```{r}
# your code here
```

We can see in both cases that our search terms are heavily skewing our proportions. What might we do to address this?

-   your response here

## 4. MODEL

As highlighted in [Chapter 3 of Data Science in Education Using R](https://datascienceineducation.com/c03.html) , the **Model** step of the data science process entails "using statistical models, from simple to complex, to understand trends and patterns in the data." The authors note that while descriptive statistics and data visualization during the **Explore** step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful. In TM Learning Lab 3, we'll take a closer look at the study by @rosenberg2020 and how they used modeling to compare differences in sentiment between teachers and non-teachers discussing the common core.

## 5. COMMUNICATE

Congratulations - you've completed the first text mining learning lab! To complete your work, you can click the drop down arrow at the top of the file, then select "Knit top HTML". This will create a report in your Files pane that serves as a record of your code and its output you can open or share.

If you wanted, you could save the processed data set to your data folder. The `write_csv()` function is useful for this. The following code is set to not run, as we wanted to ensure that everyone had the data set needed to begin the second learning lab, but if you're confident in your prepared data, you can save it with the following:

```{r, eval = FALSE}
write_csv()
```

## Reach (Optional)

If you're using data that you brought to the institute or data that you pulled from Twitter, try tidying your data into a tidy text format and examining the top words in your dataset.

If you'd like to use the data we've been working with for your reach, let's try some basic data visualization with text. The `wordcloud2` package is pretty dead simple tool for generating HTML based word clouds.

For example, let's load the `wordclouds2` library, and run the `wordcloud2()` function on our `ccss_counts` data frame:

```{r}
library(wordcloud2)

wordcloud2(ccss_counts)
```

As you can see, "math" is a pretty common topic with discussing the common core on twitter but words like "core" and "common" are not very helpful since those were in our search terms when pulling data from Twitter.

In a separate r script file, try modifying our list of stop words, retidying our text, and doing a new word count with these and perhaps other words removed. Also, take a look at the help file for `wordclouds2` to see if there might be otherwise you could visually improve this visualization.

Word clouds are much maligned and sometimes referred to as pie charts for words, but they can be useful for quickly summarizing and communicating qualitative data for education practitioners and are intuitive for them to interpret. Also, for better or worse, these are now included as a default viz for open-ended survey items in online Qualtrics reports.

## Reflection

In this learning lab, we focused on the literature guiding our analysis; wrangling our data into a one-token-per-row tidy text format; and using simple word counts and frequencies to compare common words used in tweets about the NGSS and CCSS curriculum standards. Below, add a few notes in response to the following prompts:

1.  One thing I took away from this learning lab:

    -   

2.  One thing I want to learn more about:

    -   

## References

Note: Citations embedded in R Markdown will only show upon knitting.
